An Effective High-Performance Multiway Spatial Join Algorithm with Spark
نویسندگان
چکیده
Multiway spatial join plays an important role in GIS (Geographic Information Systems) and their applications. With the increase in spatial data volumes, the performance of multiway spatial join has encountered a computation bottleneck in the context of big data. Parallel or distributed computing platforms, such as MapReduce and Spark, are promising for resolving the intensive computing issue. Previous approaches have focused on developing single-threaded join algorithms as an optimizing and partition strategy for parallel computing. In this paper, we present an effective high-performance multiway spatial join algorithm with Spark (MSJS) to overcome the multiway spatial join bottleneck. MSJS handles the problem through cascaded pairwise join. Using the power of Spark, the formerly inefficient cascaded pairwise spatial join is transformed into a high-performance approach. Experiments using massive real-world data sets prove that MSJS outperforms existing parallel approaches of multiway spatial join that have been described in the literature.
منابع مشابه
A New Design of High-Performance Large-Scale GIS Computing at a Finer Spatial Granularity: A Case Study of Spatial Join with Spark for Sustainability
Sustainability research faces many challenges as respective environmental, urban and regional contexts are experiencing rapid changes at an unprecedented spatial granularity level, which involves growing massive data and the need for spatial relationship detection at a faster pace. Spatial join is a fundamental method for making data more informative with respect to spatial relations. The drama...
متن کاملLocationSpark: A Distributed In-Memory Data Management System for Big Spatial Data
We present LocationSpark, a spatial data processing system built on top of Apache Spark, a widely used distributed data processing system. LocationSpark offers a rich set of spatial query operators, e.g., range search, kNN, spatio-textual operation, spatial-join, and kNN-join. To achieve high performance, LocationSpark employs various spatial indexes for in-memory data, and guarantees that immu...
متن کاملMultiway Equijoin Query Acceleration Using Hit-Lists
This paper presents a new data structure for multiway and general join query acceleration, the hit-list, and an algorithm for its use. The hit-list is a surrogate index providing the mapping between the values of two attributes in a relation participating in an equijoin or a selection. The results of an analytical model, simulation study, and an implementation are presented. The performance adv...
متن کاملGeoSpark: A Cluster Computing Framework for Processing Spatial Data
This paper introduces GeoSpark an in-memory cluster computing framework for processing large-scale spatial data. GeoSpark consists of three layers: Apache Spark Layer, Spatial RDD Layer and Spatial Query Processing Layer. Apache Spark Layer provides basic Spark functionalities that include loading / storing data to disk as well as regular RDD operations. Spatial RDD Layer consists of three nove...
متن کاملTo appear in SIGMOD 1996 1 Partition Based Spatial – Merge Join
This paper describes PBSM (Partition Based Spatial–Merge), a new algorithm for performing spatial join operation. This algorithm is especially effective when neither of the inputs to the join have an index on the joining attribute. Such a situation could arise if both inputs to the join are intermediate results in a complex query, or in a parallel environment where the inputs must be dynamicall...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- ISPRS Int. J. Geo-Information
دوره 6 شماره
صفحات -
تاریخ انتشار 2017